Search CORE

60 research outputs found

Performance Evaluation of Parallel Sparse Matrix–Vector Products on SGI Altix3700

Author: Akira Nishida
Akira Nukada
Hidehiko Hasegawa
Hisashi Kotakemori
Reiji Suda
Tamito Kajiyama
Publication venue
Publication date: 01/01/2008
Field of study

Abstract. The present paper discusses scalable implementations of sparse matrix-vector products, which are crucial for high performance solutions of large-scale linear equations, on a cc-NUMA machine SGI Altix3700. Three storage formats for sparse matrices are evaluated, and scalability is attained by implementations considering the page allocation mechanism of the NUMA machine. Influences of the cache/memory bus architectures on the optimum choice of the storage format are examined, and scalable converters between storage formats shown to facilitate exploitation of storage formats of higher performance.

CiteSeerX

Crossref

Linpack evaluation on a supercomputer with heterogeneous accelerators

Author: Akira Nukada
Naoya Maruyama
Satoshi Matsuoka
Toshio Endo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

Abstract—We report Linpack benchmark results on the TSUBAME supercomputer, a large scale heterogeneous system equipped with NVIDIA Tesla GPUs and ClearSpeed SIMD accelerators. With all of 10,480 Opteron cores, 640 Xeon cores, 648 ClearSpeed accelerators and 624 NVIDIA Tesla GPUs, we have achieved 87.01TFlops, which is the third record as a heterogeneous system in the world. This paper describes careful tuning and load balancing method required to achieve this performance. On the other hand, since the peak speed is 163 TFlops, the efficiency is 53%, which is lower than other systems. This paper also analyses this gap from the aspect of system architecture. I

CiteSeerX

Crossref

Efficient high-precision integer multiplication on the GPU

Author: Amor Margarita
Doallo Ramón
Matsuoka Satoshi
Nukada Akira
Pérez Diéguez Adrián
Publication venue: SAGE Journals
Publication date: 01/03/2022
Field of study

Dieguez AP, Amor M, Doallo R, Nukada A, Matsuoka S. Efficient high precision integer multiplication on the GPU. The International Journal of High Performance Computing Applications. 2022;36(3):356-369.© The Author(s) 2022. Publisher: SAGE Publications. https://doi.org/10.1177/10943420221077964[Abstract]: The multiplication of large integers, which has many applications in computer science, is an operation that can be expressed as a polynomial multiplication followed by a carry normalization. This work develops two approaches for efficient polynomial multiplication: one approach is based on tiling the classical convolution algorithm, but taking advantage of new CUDA architectures, a novelty approach to compute the multiplication using integers without accuracy lossless; the other one is based on the Strassen algorithm, an algorithm that multiplies large polynomials using the FFT operation, but adapting the fastest FFT libraries for current GPUs and working on the complex field. Previous studies reported that the Strassen algorithm is an effective implementation for “large enough” integers on GPUs. Additionally, most previous studies do not examine the implementation of the carry normalization, but this work describes a parallel implementation for this operation. Our results show the efficiency of our approaches for short, medium, and large sizes.The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work has been supported by the Ministry of Science and Innovation of Spain (PID2019-104184RB-I00), by the Galician Government and FEDER funds under the Consolidation Program of Competitive Reference Groups (UDC/GI-000265, ref. ED431C 2021/30), by the Consolidation Program of Competitive Research Units (ED431G2019/01), and by the FPU Program of the Ministry of Education of Spain (FPU14/02801). It is also partially supported by JST CREST [JPMJCR1303 and JPMJCR1687] and NVIDIA GPU Center of Excellence and conducted as research activities of AIST-TokyoTech Real World Big-Data Computation Open Innovation Laboratory (RWBC-OIL).Xunta de Galicia; ED431C 2021/3

Repositorio da Universidade da Coruña

CUDA版自動チューニング手法

Author: Nukada Akira
額田彰
Publication venue
Publication date: 28/03/2013
Field of study

Institutional Repositories DataBase (IRDB)

Efficient execution of multiple CUDA applications using transparent suspend, resume and migration

Author: Nukada Akira
額田彰
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 14/03/2019
Field of study

Institutional Repositories DataBase (IRDB)

Modeling gather and scatter with hardware performance counters for Xeon Phi

Author: Nukada Akira
額田彰
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 14/03/2019
Field of study

Institutional Repositories DataBase (IRDB)

Toward automatic performance tuning for numerical simulations in the SILC matrix computation framework

Author: Nukada Akira
額田彰
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 14/03/2019
Field of study

Institutional Repositories DataBase (IRDB)

YAMAZAKI, NUKADA, MOCHIMARU: HAMMING COLOR CODE 1 Hamming Color Code for Dense and Robust One-shot 3D Scanning

Author: Akira Nukada
Masaaki Mochimaru
Shuntaro Yamazaki
Publication venue
Publication date: 01/01/2011
Field of study

We propose a novel color code, Hamming color code, designed for rapid 3D shape acquisition using structured light projection. The Hamming color code has several properties which are desirable for practical 3D acquisition as follows. First, the Hamming distance of adjacent colors is always 1, which makes the color detection robust to color blending due to defocusing, subsurface scattering, or chromatic aberration. Second, the substrings of a certain length is guaranteed to be unique. In other words, the Hamming code can be viewed as a subset of de Bruijn sequence. Third, a one-dimensional coordinate can be encoded for each pixel, which enables dense 3D reconstruction from a single pattern projection. Thanks to the uniqueness and robustness of the substrings, the structured light can be decoded stably by dynamic programming. We have implemented parallel dynamic programming on GPU and achieved the speed-up by a factor of 630 compared to the CPU-based implementation, and accomplished video-rate 3D acquisition using commodity hardware. Several experiments have been conducted to demonstrate the stability and performance of our algorithm. Finally we discuss the limitation and future direction of this work.

CiteSeerX

Crossref